In this project, you will create a linear regression model for predicting total post-index cost and a logistic regression model for predicting the pdc flag for diabetic patients.
The first step in creating the models is variable selection. As you select your variables, keep in mind that smaller models are easier to interpret and less prone to overfitting.
Utilize scatter and box-plots, statistics tests such as Chi-Square, t-test, Wilcoxon rank-sum test, F-test, ANOVA, and Stepwise regression techniques to assist in variable selection. You can also compare the performance of differ- ent models on the test set by measuring the out-of-sample error.
For the linear regression model, you will need to perform residual analysis to check model assumptions.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%cd ~/Downloads/
Data = pd.read_csv("UOP_data_extract _matched_w_key.csv", skiprows=1)
len(Data)
There shouldn't be much work in terms of cleaning, but there are certainly features that we can drop that won't help us with either model. FOr instance, any post-index measures can be dropped because they are taken at the same time as the post-index cost, thus they can't help with making predictions before the cost is incurred, and they can't help with assigning PDC flag because the PDC level is calculated before any post-index measures occur.
We can also drop NaNs because some lodels such as logistic regression won't perform if there are missing values from the features.
AC = Data[Data.drug_class == '*ANTICOAGULANTS*']
AD = Data[Data.drug_class == '*ANTIDIABETICS*']
Data = Data[Data.drug_class == '*ANTIDIABETICS*']
Data = Data.drop('num_ip_post', axis=1)
Data = Data.drop('total_los_post', axis=1)
Data = Data.drop('num_op_post', axis=1)
Data = Data.drop('num_er_post', axis=1)
Data = Data.drop('num_ndc_post', axis=1)
Data = Data.drop('num_gpi6_post', axis=1)
Data = Data.drop('adjust_total_30d_post', axis=1)
Data = Data.drop('generic_rate_post', axis=1)
Data = Data.drop('post_ip_flag', axis=1)
Data = Data.drop('post_er_flag', axis=1)
Data = Data.drop('pdc_cat', axis=1)
Data = Data.drop('post_ip_cost', axis=1)
Data = Data.drop('post_er_cost', axis=1)
Data = Data.drop('post_rx_cost', axis=1)
Data = Data.drop('post_op_cost', axis=1)
Data = Data.drop('post_medical_cost', axis=1)
Data = Data.drop('drug_class', axis=1)
Data = Data.drop('patient_key', axis=1)
Data = Data.drop('pdc', axis=1)
Data = Data.drop('post_total_cost', axis=1)
Data = Data.drop('generic_cost_post', axis=1)
Data = Data.drop('brand_cost_post', axis=1)
Data = Data.drop('ratio_G_total_cost_post', axis=1)
Data[['Age_18_44', 'Age_45_64', 'Age_65_Up']] = pd.get_dummies(Data.age_grpN)
Data[['Male', 'Female']] = pd.get_dummies(Data.sexN)
Data[['Reg_NE', 'Reg_MW', 'Reg_S', 'Reg_W']] = pd.get_dummies(Data.regionN)
#Drop repeating columns now that we have their dummy values
Data = Data.drop('age_grpN', axis=1)
Data = Data.drop('sexN', axis=1)
Data = Data.drop('regionN', axis=1)
Data = Data.dropna()
from sklearn.preprocessing import StandardScaler
#Standardize the Matrix values to prepare for feature processing
sc = StandardScaler()
tst = sc.fit_transform(Data['Male'])
tst
This really messes things up, so we will not be standardizing any discrete values.
Data['idx_copay'] = sc.fit_transform(Data['idx_copay'])
Data['log_idx_copay'] = sc.fit_transform(Data['log_idx_copay'])
Data['pre_ip_cost'] = sc.fit_transform(Data['pre_ip_cost'])
Data['pre_er_cost'] = sc.fit_transform(Data['pre_er_cost'])
Data['pre_rx_cost'] = sc.fit_transform(Data['pre_rx_cost'])
Data['pre_op_cost'] = sc.fit_transform(Data['pre_op_cost'])
Data['pre_total_cost'] = sc.fit_transform(Data['pre_total_cost'])
Data['pre_medical_cost'] = sc.fit_transform(Data['pre_medical_cost'])
Data['adjust_total_30d'] = sc.fit_transform(Data['adjust_total_30d'])
Data['generic_rate'] = sc.fit_transform(Data['generic_rate'])
Data['log_pre_ip_cost'] = sc.fit_transform(Data['log_pre_ip_cost'])
Data['log_pre_er_cost'] = sc.fit_transform(Data['log_pre_er_cost'])
Data['log_pre_op_cost'] = sc.fit_transform(Data['log_pre_op_cost'])
Data['log_pre_rx_cost'] = sc.fit_transform(Data['log_pre_rx_cost'])
Data['generic_cost'] = sc.fit_transform(Data['generic_cost'])
Data['brand_cost'] = sc.fit_transform(Data['brand_cost'])
Data['ratio_G_total_cost'] = sc.fit_transform(Data['ratio_G_total_cost'])
pd.value_counts(Data['Reg_W'])
pairs = pd.DataFrame([Data.idx_copay, Data.log_idx_copay, Data.pre_ip_cost, Data.pre_er_cost, Data.pdc_80_flag])
%matplotlib inline
sns.set_context('poster')
sns.mpl.rc("figure", figsize=(12,6))
sns.pairplot(Data, size=3, vars=["idx_copay", "log_idx_copay", "pre_ip_cost", "pre_er_cost", "pre_rx_cost",
"pre_op_cost", "pre_medical_cost", "pre_total_cost", "adjust_total_30d",
"generic_rate", "log_pre_ip_cost", "log_pre_er_cost", 'log_pre_op_cost',
'log_pre_rx_cost', "generic_cost", "brand_cost", "ratio_G_total_cost"],
hue='pdc_80_flag');
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 20)
print('Total features:', Data.shape[1])
print('Total observations:', Data.shape[0])
Data.head(3)
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
We will be working with Chi-Squate Tests, T-Tests, Wilcoxon Rank-Sum Tests, F-Tests, ANOVA Tests, and Stepwise Regression Techniques to determine which features are most important. Each test is uniquely qualified for different types of data, and provides us different findings on the relationships between features, and the meaning of each p-value. We will also be working with Logistic Regressions, Linear Regressions, and Residual Analysis. We will define each test and model below before proceeding further.
An approach for modeling the relationship between a scalar dependent variable y, and one or more independent variables X. In this case we will be working with multiple linear regression because we have many correlated features that we want to include in our model of feature y. The relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Linear regression focuses on the conditional probability distribution of y given X, rather than a join probability distribution of y and X. One of the main goals of linear regression is prediction of the dependent variable, but it can also be used to assess the relationship between y and Xj, to assess which Xj may have no relationship with y at all, and to identify which subsets of the Xj contain redundant information about y.
Conditions:
The binary logistic model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features). As such it is not a classification method. Logistic regression can be seen as a special case of generalized linear model and thus analogous to linear regression, but is based on different assumptions. First, the conditional distribution y \mid x is a Bernoulli distribution rather than a Gaussian distribution, because the dependent variable is binary. Second, the predicted values are probabilities and are therefore restricted to (0,1) through the logistic distribution function because logistic regression predicts the probability of particular outcomes.
Conditions:
Using residual plots, you can assess whether the observed error (residuals) is consistent with stochastic error. In regression models, You shouldn’t be able to predict the error for any given observation. And, for a series of observations, you can determine whether the residuals are consistent with random error.
A chi square (X2) statistic is used to investigate whether distributions of categorical variables differ from one another. The Chi Square statistic compares the tallies or counts of categorical responses between two (or more) independent groups. (note: Chi square tests can only be used on actual numbers and not on percentages, proportions, means, etc.)
Conditions:
Student's t test for independent samples is used to determine whether two samples were drawn from populations with different means. Using a one-tailed P-value assumes you already know before you even see the values which group should be larger and which should be smaller. Since this is usually not true, we almost always use a two-tailed test. If you have more than two groups, you shouldn't use multiple t-tests because the error adds up (see familywise error) and thus you increase your chances of finding an effect when there really isn't one (type 1 error). Therefore when you have more than two groups to compare e.g. in a drugs trial when you have a high dose, low does and a placebo group (3 groups), you use ANOVA to examine whether there any differences between the groups.
Conditions:
Interpreting the p-value: The null hypothesis states that the two samples come from the same group. If the p-value falls below the significance level, we reject the null hypothesis, and assume that the samples come from two different populations.
Conditions:
When the requirements for the t-test for two independent samples are not satisfied, the Wilcoxon Rank-Sum non-parametric test can often be used provided the two independent samples are drawn from populations with an ordinal distribution. It studies whether or not different samples come from the same population. If one observation is made at random from each population (call them x0 and y0), then the probability that x0 > y0 is the same as the probability that x0 < y0, and so the populations for each sample have the same medians. Since it compares rank sums, the Wilcoxon Rank-Sum test is more robust than the t-test as it is less likely to indicate spurious results based on the presence of outliers. Even for large samples where the assumptions for the t-test are met, the Wilcoxon Rank-Sum test is only a little less efficient than the t-test.
Interpreting the p-value:The null hypothesis states that the observations come from the same population. If our p-value falls below a specified alpha level, we reject the null hypothesis, and assume that the samples come from two different populations
Conditions:
The F-distribution is formed by the ratio of two independent chi-square variables divided by their respective degrees of freedom. Since F is formed by chi-square, many of the chi-square properties carry over to the F distribution. The F-test is designed to test if two population variances are equal, and does this by comparing the ratio of two variances. If the variances are equal, the ratio of the variances will be 1. The F-distribution is commonly used in ANOVA, and it is the ratio of two chi-square distributions, and hence is right skewed. It has a minimum of 0, but no maximum value (all values are positive). The peak of the distribution is not far from 0. The F-test is the ratio of systematic variance : unsystematic variance, so higher scores are better.
Interpreting the p-value: The null hypothesis states that the two sample variances are equal. If our p-value falls below the assigned alpha level, we reject the null hypothesis and assume that our sample variances are not equal.
Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences among group means and their associated procedures (such as "variation" among and between groups). ANOVA can be used in cases where there are more than two groups. The t-test can only be used to test differences between two means. When there are more than two means, but conducting such multiple t-tests can lead to complications. In such circumstances we use ANOVA. This technique is used whenever an alternative procedure is needed for testing hypotheses concerning means when there are several populations.
Stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure. Usually, this takes the form of a sequence of F-tests or t-tests. Properly used, stepwise regression puts more power and information at your fingertips than does the ordinary multiple regression option, and it is especially useful for sifting through large numbers of potential independent variables and/or fine-tuning a model by poking variables in or out. Improperly used, it may converge on a poor model while giving you a false sense of security.
Data.insert(0, 'pdc_flag', Data.pdc_80_flag)
Data = Data.drop('pdc_80_flag', axis=1)
Data.head(3)
PDC_0 = Data[Data.pdc_flag == 0]
PDC_1 = Data[Data.pdc_flag == 1]
from scipy.stats import chisquare
from scipy.stats import chi2_contingency
from sklearn.linear_model import Lasso, LassoCV, LassoLarsCV, LassoLarsIC, MultiTaskLasso
#clf = linear_model.Lasso(alpha=0.1)
X_Frame = Data.iloc[:, 1:].values
y_Frame = Data.iloc[:, 0].values
Lasso(Data.iloc[:, 1:], Data.iloc[:, 0])
pd.set_option('display.max_rows', 20)
#Plotting Copay to check for normality
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)
import statsmodels.api as sm
sm.qqplot(Data.idx_copay, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Copay', size=12, weight='bold')
sns.distplot(Data.idx_copay, bins=20)
sns.rugplot(Data.idx_copay, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Copay', size=12, weight='bold')
plt.show();
from scipy.stats import ttest_ind
from scipy.stats import ranksums
# T-Test:
tstat, idx_copay_p_value = ttest_ind(PDC_0.idx_copay, PDC_1.idx_copay, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, idx_copay_p_valueZ = ranksums(PDC_0.idx_copay, PDC_1.idx_copay)
print("p-Value, t-Test =", idx_copay_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", idx_copay_p_valueZ)
Data.pivot_table(['pdc_flag'], index=[Data.idx_prodtypeN, Data.pdc_flag], aggfunc=len).unstack()
from scipy.stats import chisquare
from scipy.stats import chi2_contingency
chi2, idx_prodtypeN_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.idx_prodtypeN, Data.pdc_flag],
aggfunc=len).unstack())
idx_prodtypeN_p_value = round(idx_prodtypeN_p_value, 5)
print("p-Value = ", idx_prodtypeN_p_value)
Data.pivot_table(['pdc_flag'], index=[Data.idx_paytypN, Data.pdc_flag], aggfunc=len).unstack()
chi2, idx_paytypN_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.idx_paytypN, Data.pdc_flag],
aggfunc=len).unstack())
idx_paytypN_p_value = round(idx_paytypN_p_value, 5)
print("p-Value = ", idx_paytypN_p_value)
chi2, age_cat_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.age_cat, Data.pdc_flag],
aggfunc=len).unstack())
age_cat_p_value = round(age_cat_p_value, 5)
print("p-Value = ", age_cat_p_value)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.log_idx_copay, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Log Copay', size=12, weight='bold')
sns.distplot(Data.log_idx_copay, bins=20)
sns.rugplot(Data.log_idx_copay, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Log Copay', size=12, weight='bold')
plt.show();
# T-Test:
tstat, log_idx_copay_p_value = ttest_ind(PDC_0.log_idx_copay, PDC_1.log_idx_copay, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, log_idx_copay_p_valueZ = ranksums(PDC_0.log_idx_copay, PDC_1.log_idx_copay)
print("p-Value, t-Test =", log_idx_copay_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", log_idx_copay_p_valueZ)
#PDC Relationship
chi2, ALCOHOL_DRUG_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.ALCOHOL_DRUG, Data.pdc_flag],
aggfunc=len).unstack())
ALCOHOL_DRUG_p_value = round(ALCOHOL_DRUG_p_value, 5)
print("p-Value = ", ALCOHOL_DRUG_p_value)
chi2, ASTHMA_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.ASTHMA, Data.pdc_flag],
aggfunc=len).unstack())
ASTHMA_p_value = round(ASTHMA_p_value, 5)
print("p-Value = ", ASTHMA_p_value)
chi2, CARDIAC_ARRYTHMIA_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.CARDIAC_ARRYTHMIA, Data.pdc_flag],
aggfunc=len).unstack())
CARDIAC_ARRYTHMIA_p_value = round(CARDIAC_ARRYTHMIA_p_value, 5)
print("p-Value = ", CARDIAC_ARRYTHMIA_p_value)
chi2, CARDIAC_VALVULAR_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.CARDIAC_VALVULAR, Data.pdc_flag],
aggfunc=len).unstack())
CARDIAC_VALVULAR_p_value = round(CARDIAC_VALVULAR_p_value, 5)
print("p-Value = ", CARDIAC_VALVULAR_p_value)
chi2, CEREBROVASCULAR_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.CEREBROVASCULAR, Data.pdc_flag],
aggfunc=len).unstack())
CEREBROVASCULAR_p_value = round(CEREBROVASCULAR_p_value, 5)
print("p-Value = ", CEREBROVASCULAR_p_value)
chi2, CHRONIC_KIDNEY_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.CHRONIC_KIDNEY, Data.pdc_flag],
aggfunc=len).unstack())
CHRONIC_KIDNEY_p_value = round(CHRONIC_KIDNEY_p_value, 5)
print("p-Value = ", CHRONIC_KIDNEY_p_value)
chi2, CHRONIC_PAIN_FIBRO_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.CHRONIC_PAIN_FIBRO, Data.pdc_flag],
aggfunc=len).unstack())
CHRONIC_PAIN_FIBRO_p_value = round(CHRONIC_PAIN_FIBRO_p_value, 5)
print("p-Value = ", CHRONIC_PAIN_FIBRO_p_value)
chi2, CHF_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.CHF, Data.pdc_flag],
aggfunc=len).unstack())
CHF_p_value = round(CHF_p_value, 5)
print("p-Value = ", CHF_p_value)
chi2, COPD_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.COPD, Data.pdc_flag],
aggfunc=len).unstack())
COPD_p_value = round(COPD_p_value, 5)
print("p-Value = ", COPD_p_value)
chi2, DEMENTIA_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.DEMENTIA, Data.pdc_flag],
aggfunc=len).unstack())
DEMENTIA_p_value = round(DEMENTIA_p_value, 5)
print("p-Value = ", DEMENTIA_p_value)
chi2, DEPRESSION_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.DEPRESSION, Data.pdc_flag],
aggfunc=len).unstack())
DEPRESSION_p_value = round(DEPRESSION_p_value, 5)
print("p-Value = ", DEPRESSION_p_value)
chi2, DIABETES_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.DIABETES, Data.pdc_flag],
aggfunc=len).unstack())
DIABETES_p_value = round(DIABETES_p_value, 5)
print("p-Value = ", DIABETES_p_value)
#Well that makes sense, huh?
chi2, DYSLIPIDEMIA_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.DYSLIPIDEMIA, Data.pdc_flag],
aggfunc=len).unstack())
DYSLIPIDEMIA_p_value = round(DYSLIPIDEMIA_p_value, 5)
print("p-Value = ", DYSLIPIDEMIA_p_value)
chi2, EPILEPSY_SEIZURE_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.EPILEPSY_SEIZURE, Data.pdc_flag],
aggfunc=len).unstack())
DYSLIPIDEMIA_p_value = round(EPILEPSY_SEIZURE_p_value, 5)
print("p-Value = ", EPILEPSY_SEIZURE_p_value)
chi2, HEPATITIS_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.HEPATITIS, Data.pdc_flag],
aggfunc=len).unstack())
HEPATITIS_p_value = round(HEPATITIS_p_value, 5)
print("p-Value = ", HEPATITIS_p_value)
chi2, HIV_AIDS_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.HIV_AIDS, Data.pdc_flag],
aggfunc=len).unstack())
HIV_AIDS_p_value = round(HIV_AIDS_p_value, 5)
print("p-Value = ", HIV_AIDS_p_value)
chi2, HYPERTENSION_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.HYPERTENSION, Data.pdc_flag],
aggfunc=len).unstack())
HYPERTENSION_p_value = round(HYPERTENSION_p_value, 5)
print("p-Value = ", HYPERTENSION_p_value)
chi2, LIVER_GALLBLADDER_PANCREAS_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.LIVER_GALLBLADDER_PANCREAS, Data.pdc_flag],
aggfunc=len).unstack())
LIVER_GALLBLADDER_PANCREAS_p_value = round(LIVER_GALLBLADDER_PANCREAS_p_value, 5)
print("p-Value = ", LIVER_GALLBLADDER_PANCREAS_p_value)
chi2, MI_CAD_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.MI_CAD, Data.pdc_flag],
aggfunc=len).unstack())
MI_CAD_p_value = round(MI_CAD_p_value, 5)
print("p-Value = ", MI_CAD_p_value)
chi2, OSTEOARTHRITIS_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.OSTEOARTHRITIS, Data.pdc_flag],
aggfunc=len).unstack())
OSTEOARTHRITIS_p_value = round(OSTEOARTHRITIS_p_value, 5)
print("p-Value = ", OSTEOARTHRITIS_p_value)
chi2, PARALYSIS_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.PARALYSIS, Data.pdc_flag],
aggfunc=len).unstack())
PARALYSIS_p_value = round(PARALYSIS_p_value, 5)
print("p-Value = ", PARALYSIS_p_value)
chi2, PEPTIC_ULCER_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.PEPTIC_ULCER, Data.pdc_flag],
aggfunc=len).unstack())
PEPTIC_ULCER_p_value = round(PEPTIC_ULCER_p_value, 5)
print("p-Value = ", PEPTIC_ULCER_p_value)
chi2, PERIPHERAL_VASCULAR_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.PERIPHERAL_VASCULAR, Data.pdc_flag],
aggfunc=len).unstack())
PERIPHERAL_VASCULAR_p_value = round(PERIPHERAL_VASCULAR_p_value, 5)
print("p-Value = ", PERIPHERAL_VASCULAR_p_value)
chi2, RENAL_FAILURE_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.RENAL_FAILURE, Data.pdc_flag],
aggfunc=len).unstack())
RENAL_FAILURE_p_value = round(RENAL_FAILURE_p_value, 5)
print("p-Value = ", RENAL_FAILURE_p_value)
chi2, RHEUMATOLOGIC_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.RHEUMATOLOGIC, Data.pdc_flag],
aggfunc=len).unstack())
RHEUMATOLOGIC_p_value = round(RHEUMATOLOGIC_p_value, 5)
print("p-Value = ", RHEUMATOLOGIC_p_value)
chi2, SCHIZOPHRENIA_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.SCHIZOPHRENIA, Data.pdc_flag],
aggfunc=len).unstack())
SCHIZOPHRENIA_p_value = round(SCHIZOPHRENIA_p_value, 5)
print("p-Value = ", SCHIZOPHRENIA_p_value)
chi2, SLEEP_DISORDERS_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.SLEEP_DISORDERS, Data.pdc_flag],
aggfunc=len).unstack())
SLEEP_DISORDERS_p_value = round(SLEEP_DISORDERS_p_value, 5)
print("p-Value = ", SLEEP_DISORDERS_p_value)
chi2, SMOKING_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.SMOKING, Data.pdc_flag],
aggfunc=len).unstack())
SMOKING_p_value = round(SMOKING_p_value, 5)
print("p-Value = ", SMOKING_p_value)
chi2, THYROID_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.THYROID, Data.pdc_flag],
aggfunc=len).unstack())
THYROID_p_value = round(THYROID_p_value, 5)
print("p-Value = ", THYROID_p_value)
chi2, Solid_Tumor_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.Solid_Tumor, Data.pdc_flag],
aggfunc=len).unstack())
Solid_Tumor_p_value = round(Solid_Tumor_p_value, 5)
print("p-Value = ", Solid_Tumor_p_value)
chi2, Metastatic_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.Metastatic, Data.pdc_flag],
aggfunc=len).unstack())
Metastatic_p_value = round(Metastatic_p_value, 5)
print("p-Value = ", Metastatic_p_value)
chi2, Leukemia_Lymphoma_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.Leukemia_Lymphoma, Data.pdc_flag],
aggfunc=len).unstack())
Leukemia_Lymphoma_p_value = round(Leukemia_Lymphoma_p_value, 5)
print("p-Value = ", Leukemia_Lymphoma_p_value)
chi2, Other_Cancer_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.Other_Cancer, Data.pdc_flag],
aggfunc=len).unstack())
Other_Cancer_p_value = round(Other_Cancer_p_value, 5)
print("p-Value = ", Other_Cancer_p_value)
chi2, Cancer_In_Situ_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.Cancer_In_Situ, Data.pdc_flag],
aggfunc=len).unstack())
Cancer_In_Situ_p_value = round(Cancer_In_Situ_p_value, 5)
print("p-Value = ", Cancer_In_Situ_p_value)
chi2, pre_CCI_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.pre_CCI, Data.pdc_flag],
aggfunc=len).unstack())
pre_CCI_p_value = round(pre_CCI_p_value, 5)
print("p-Value = ", pre_CCI_p_value)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.pre_ip_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Inpatient Costs', size=12, weight='bold')
sns.distplot(Data.pre_ip_cost, bins=20)
sns.rugplot(Data.pre_ip_cost, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Inpatient Costs', size=12, weight='bold')
plt.show();
# T-Test:
tstat, pre_ip_cost_p_value = ttest_ind(PDC_0.pre_ip_cost, PDC_1.pre_ip_cost, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, pre_ip_cost_p_valueZ = ranksums(PDC_0.pre_ip_cost, PDC_1.pre_ip_cost)
print("p-Value, t-Test =", pre_ip_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", pre_ip_cost_p_valueZ)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.pre_er_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Emergency Costs', size=12, weight='bold')
sns.distplot(Data.pre_er_cost, bins=20)
sns.rugplot(Data.pre_er_cost, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Emergency Costs', size=12, weight='bold')
plt.show();
# T-Test:
tstat, pre_er_cost_p_value = ttest_ind(PDC_0.pre_er_cost, PDC_1.pre_er_cost, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, pre_er_cost_p_valueZ = ranksums(PDC_0.pre_er_cost, PDC_1.pre_er_cost)
print("p-Value, t-Test =", pre_er_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", pre_er_cost_p_valueZ)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.pre_rx_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Pharmacy Costs', size=12, weight='bold')
sns.distplot(Data.pre_rx_cost, bins=20)
sns.rugplot(Data.pre_rx_cost, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Pharmacy Costs', size=12, weight='bold')
plt.show();
# T-Test:
tstat, pre_rx_cost_p_value = ttest_ind(PDC_0.pre_rx_cost, PDC_1.pre_rx_cost, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, pre_rx_cost_p_valueZ = ranksums(PDC_0.pre_rx_cost, PDC_1.pre_rx_cost)
print("p-Value, t-Test =", pre_rx_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", pre_rx_cost_p_valueZ)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.pre_op_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Outpatient Costs', size=12, weight='bold')
sns.distplot(Data.pre_op_cost, bins=20)
sns.rugplot(Data.pre_op_cost, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Outpatient Costs', size=12, weight='bold')
plt.show();
# T-Test:
tstat, pre_op_cost_p_value = ttest_ind(PDC_0.pre_op_cost, PDC_1.pre_op_cost, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, pre_op_cost_p_valueZ = ranksums(PDC_0.pre_op_cost, PDC_1.pre_op_cost)
print("p-Value, t-Test =", pre_op_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", pre_op_cost_p_valueZ)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.pre_total_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Total Healthcare Costs', size=12, weight='bold')
sns.distplot(Data.pre_total_cost, bins=20)
sns.rugplot(Data.pre_total_cost, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Total Healthcare Costs', size=12, weight='bold')
plt.show();
# T-Test:
tstat, pre_total_cost_p_value = ttest_ind(PDC_0.pre_total_cost, PDC_1.pre_total_cost, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, pre_total_cost_p_valueZ = ranksums(PDC_0.pre_total_cost, PDC_1.pre_total_cost)
print("p-Value, t-Test =", pre_total_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", pre_total_cost_p_valueZ)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.pre_medical_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Total Medical Costs', size=12, weight='bold')
sns.distplot(Data.pre_medical_cost, bins=20)
sns.rugplot(Data.pre_medical_cost, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Total Medical Costs', size=12, weight='bold')
plt.show();
# T-Test:
tstat, pre_medical_cost_p_value = ttest_ind(PDC_0.pre_medical_cost, PDC_1.pre_medical_cost, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, pre_medical_cost_p_valueZ = ranksums(PDC_0.pre_medical_cost, PDC_1.pre_medical_cost)
print("p-Value, t-Test =", pre_medical_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", pre_medical_cost_p_valueZ)
chi2, num_ip_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.num_ip, Data.pdc_flag],
aggfunc=len).unstack())
num_ip_p_value = round(num_ip_p_value, 5)
print("p-Value = ", num_ip_p_value)
chi2, total_los_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.total_los, Data.pdc_flag],
aggfunc=len).unstack())
total_los_p_value = round(total_los_p_value, 5)
print("p-Value = ", total_los_p_value)
chi2, num_op_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.num_op, Data.pdc_flag],
aggfunc=len).unstack())
num_op_p_value = round(num_op_p_value, 5)
print("p-Value = ", num_op_p_value)
chi2, num_er_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.num_er, Data.pdc_flag],
aggfunc=len).unstack())
num_er_p_value = round(num_er_p_value, 5)
print("p-Value = ", num_er_p_value)
chi2, num_ndc_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.num_ndc, Data.pdc_flag],
aggfunc=len).unstack())
num_ndc_p_value = round(num_ndc_p_value, 5)
print("p-Value = ", num_ndc_p_value)
chi2, num_gpi6_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.num_gpi6, Data.pdc_flag],
aggfunc=len).unstack())
num_gpi6_p_value = round(num_gpi6_p_value, 5)
print("p-Value = ", num_gpi6_p_value)
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.generic_rate, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Generic Rate', size=12, weight='bold')
sns.distplot(Data.generic_rate, bins=20)
sns.rugplot(Data.generic_rate, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Generic Rate', size=12, weight='bold')
plt.show();
# T-Test:
tstat, generic_rate_p_value = ttest_ind(PDC_0.generic_rate, PDC_1.generic_rate, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, generic_rate_p_valueZ = ranksums(PDC_0.generic_rate, PDC_1.generic_rate)
print("p-Value, t-Test =", generic_rate_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", generic_rate_p_valueZ)
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.adjust_total_30d, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Adjusted Fill Rate', size=12, weight='bold')
sns.distplot(Data.adjust_total_30d, bins=20)
sns.rugplot(Data.adjust_total_30d, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Adjusted Fill Rate', size=12, weight='bold')
plt.show();
# T-Test:
tstat, adjust_total_30d_p_value = ttest_ind(PDC_0.adjust_total_30d, PDC_1.adjust_total_30d, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, adjust_total_30d_p_valueZ = ranksums(PDC_0.adjust_total_30d, PDC_1.adjust_total_30d)
print("p-Value, t-Test =", adjust_total_30d_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", adjust_total_30d_p_valueZ)
chi2, pre_ip_flag_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.pre_ip_flag, Data.pdc_flag],
aggfunc=len).unstack())
pre_ip_flag_p_value = round(pre_ip_flag_p_value, 5)
print("p-Value = ", pre_ip_flag_p_value)
chi2, pre_er_flag_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.pre_er_flag, Data.pdc_flag],
aggfunc=len).unstack())
pre_er_flag_p_value = round(pre_er_flag_p_value, 5)
print("p-Value = ", pre_er_flag_p_value)
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.log_pre_ip_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Log Inpatient Costs', size=12, weight='bold')
sns.distplot(Data.log_pre_ip_cost, bins=20)
sns.rugplot(Data.log_pre_ip_cost, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Log Inpatient Costs', size=12, weight='bold')
plt.show();
# T-Test:
tstat, log_pre_ip_cost_p_value = ttest_ind(PDC_0.log_pre_ip_cost, PDC_1.log_pre_ip_cost, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, log_pre_ip_cost_p_valueZ = ranksums(PDC_0.log_pre_ip_cost, PDC_1.log_pre_ip_cost)
print("p-Value, t-Test =", log_pre_ip_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", log_pre_ip_cost_p_valueZ)
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.log_pre_er_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Log ER Costs', size=12, weight='bold')
sns.distplot(Data.log_pre_er_cost, bins=20)
sns.rugplot(Data.log_pre_er_cost, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Log ER Costs', size=12, weight='bold')
plt.show();
# T-Test:
tstat, log_pre_er_cost_p_value = ttest_ind(PDC_0.log_pre_er_cost, PDC_1.log_pre_er_cost, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, log_pre_er_cost_p_valueZ = ranksums(PDC_0.log_pre_er_cost, PDC_1.log_pre_er_cost)
print("p-Value, t-Test =", log_pre_er_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", log_pre_er_cost_p_valueZ)
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.log_pre_op_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Log Outpatient Costs', size=12, weight='bold')
sns.distplot(Data.log_pre_op_cost, bins=20)
sns.rugplot(Data.log_pre_op_cost, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Log Outpatient Costs', size=12, weight='bold')
plt.show();
# T-Test:
tstat, log_pre_op_cost_p_value = ttest_ind(PDC_0.log_pre_op_cost, PDC_1.log_pre_op_cost, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, log_pre_op_cost_p_valueZ = ranksums(PDC_0.log_pre_op_cost, PDC_1.log_pre_op_cost)
print("p-Value, t-Test =", log_pre_op_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", log_pre_op_cost_p_valueZ)
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.log_pre_rx_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Log Pharmacy Costs', size=12, weight='bold')
sns.distplot(Data.log_pre_rx_cost, bins=20)
sns.rugplot(Data.log_pre_rx_cost, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Log Pharmacy Costs', size=12, weight='bold')
plt.show();
# T-Test:
tstat, log_pre_rx_cost_p_value = ttest_ind(PDC_0.log_pre_rx_cost, PDC_1.log_pre_rx_cost, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, log_pre_rx_cost_p_valueZ = ranksums(PDC_0.log_pre_rx_cost, PDC_1.log_pre_rx_cost)
print("p-Value, t-Test =", log_pre_rx_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", log_pre_rx_cost_p_valueZ)
chi2, pre_total_cat_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.pre_total_cat, Data.pdc_flag],
aggfunc=len).unstack())
pre_total_cat_p_value = round(pre_total_cat_p_value, 5)
print("p-Value = ", pre_total_cat_p_value)
chi2, numofgen_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.numofgen, Data.pdc_flag],
aggfunc=len).unstack())
numofgen_p_value = round(numofgen_p_value, 5)
print("p-Value = ", numofgen_p_value)
chi2, numofbrand_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.numofbrand, Data.pdc_flag],
aggfunc=len).unstack())
numofbrand_p_value = round(numofbrand_p_value, 5)
print("p-Value = ", numofbrand_p_value)
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.generic_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Generic Scripts Costs', size=12, weight='bold')
sns.distplot(Data.generic_cost, bins=20)
sns.rugplot(Data.generic_cost, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Generic Scripts Costs', size=12, weight='bold')
plt.show();
# T-Test:
tstat, generic_cost_p_value = ttest_ind(PDC_0.generic_cost, PDC_1.generic_cost, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, generic_cost_p_valueZ = ranksums(PDC_0.generic_cost, PDC_1.generic_cost)
print("p-Value, t-Test =", generic_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", generic_cost_p_valueZ)
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.brand_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Brand Scripts Costs', size=12, weight='bold')
sns.distplot(Data.brand_cost, bins=20)
sns.rugplot(Data.brand_cost, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Brand Scripts Costs', size=12, weight='bold')
plt.show();
# T-Test:
tstat, brand_cost_p_value = ttest_ind(PDC_0.brand_cost, PDC_1.brand_cost, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, brand_cost_p_valueZ = ranksums(PDC_0.brand_cost, PDC_1.brand_cost)
print("p-Value, t-Test =", brand_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", brand_cost_p_valueZ)
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)
sm.qqplot(Data.ratio_G_total_cost_post, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Generic/Pharmacy Ratio Costs', size=12, weight='bold')
sns.distplot(Data.ratio_G_total_cost_post, bins=20)
sns.rugplot(Data.ratio_G_total_cost_post, vertical=False, alpha= 0.04, color='red')
ax[1].set_title('Distribution Plot of Generic/Pharmacy Ratio Costs', size=12, weight='bold')
plt.show();
# T-Test:
tstat, ratio_G_total_cost_post_p_value = ttest_ind(PDC_0.ratio_G_total_cost_post,
PDC_1.ratio_G_total_cost_post, equal_var=True)
# Wilcoxon Rank-Sum Test:
zstat, ratio_G_total_cost_post_p_valueZ = ranksums(PDC_0.ratio_G_total_cost_post, PDC_1.ratio_G_total_cost_post)
print("p-Value, t-Test =", ratio_G_total_cost_post_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", ratio_G_total_cost_post_p_valueZ)
chi2, Age_18_44_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.Age_18_44, Data.pdc_flag],
aggfunc=len).unstack())
Age_18_44_p_value = round(Age_18_44_p_value, 5)
print("p-Value = ", Age_18_44_p_value)
chi2, Age_45_64_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.Age_45_64, Data.pdc_flag],
aggfunc=len).unstack())
Age_45_64_p_value = round(Age_45_64_p_value, 5)
print("p-Value = ", Age_45_64_p_value)
chi2, Age_65_Up_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.Age_65_Up, Data.pdc_flag],
aggfunc=len).unstack())
Age_65_Up_p_value = round(Age_65_Up_p_value, 5)
print("p-Value = ", Age_65_Up_p_value)
chi2, Male_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.Male, Data.pdc_flag],
aggfunc=len).unstack())
Male_p_value = round(Male_p_value, 5)
print("p-Value = ", Male_p_value)
chi2, Female_p_value, dof, expected = chi2_contingency(LogExtract.pivot_table(['pdc_flag'],
index=[LogExtract.Female, LogExtract.pdc_flag],
aggfunc=len).unstack())
Female_p_value = round(Female_p_value, 5)
print("p-Value = ", Female_p_value)
chi2, Reg_NE_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.Reg_NE, Data.pdc_flag],
aggfunc=len).unstack())
Reg_NE_p_value = round(Reg_NE_p_value, 5)
print("p-Value = ", Reg_NE_p_value)
chi2, Reg_MW_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.Reg_MW, Data.pdc_flag],
aggfunc=len).unstack())
Reg_MW_p_value = round(Reg_MW_p_value, 5)
print("p-Value = ", Reg_MW_p_value)
chi2, Reg_S_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.Reg_S, Data.pdc_flag],
aggfunc=len).unstack())
Reg_S_p_value = round(Reg_S_p_value, 5)
print("p-Value = ", Reg_S_p_value)
chi2, Reg_W_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'],
index=[Data.Reg_W, Data.pdc_flag],
aggfunc=len).unstack())
Reg_W_p_value = round(Reg_W_p_value, 5)
print("p-Value = ", Reg_W_p_value)
We will build several dataframes to hold our features, each getting progressively smaller as the accepted level of significance decreases
Feat_Lvl_70 = Data
Feat_Lvl_70 = Data.drop('DEMENTIA', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('DIABETES', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('SCHIZOPHRENIA', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('pre_CCI', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('num_gpi6', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('total_los', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('num_op', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('numofgen', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('num_er', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('num_ndc', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('numofbrand', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('idx_copay', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('ALCOHOL_DRUG', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('ASTHMA', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('CEREBROVASCULAR', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('CHF', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('HIV_AIDS', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('LIVER_GALLBLADDER_PANCREAS', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('OSTEOARTHRITIS', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('PARALYSIS', axis=1)
Feat_Lvl_50 = Feat_Lvl_70
Feat_Lvl_50 = Feat_Lvl_70.drop('idx_prodtypeN', axis=1)
Feat_Lvl_50 = Feat_Lvl_50.drop('EPILEPSY_SEIZURE', axis=1)
Feat_Lvl_50 = Feat_Lvl_50.drop('MI_CAD', axis=1)
Feat_Lvl_50 = Feat_Lvl_50.drop('PEPTIC_ULCER', axis=1)
Feat_Lvl_50 = Feat_Lvl_50.drop('RENAL_FAILURE', axis=1)
Feat_Lvl_50 = Feat_Lvl_50.drop('RHEUMATOLOGIC', axis=1)
Feat_Lvl_50 = Feat_Lvl_50.drop('log_pre_op_cost', axis=1)
Feat_Lvl_50 = Feat_Lvl_50.drop('Reg_W', axis=1)
Feat_Lvl_30 = Feat_Lvl_50
Feat_Lvl_30 = Feat_Lvl_50.drop('CHRONIC_KIDNEY', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('CHRONIC_PAIN_FIBRO', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('COPD', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('HEPATITIS', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('SLEEP_DISORDERS', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('Solid_Tumor', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('Other_Cancer', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('pre_ip_cost', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('pre_total_cost', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('generic_rate', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('log_pre_ip_cost', axis=1)
Feat_Lvl_20 = Feat_Lvl_30
Feat_Lvl_20 = Feat_Lvl_30.drop('CARDIAC_ARRYTHMIA', axis=1)
Feat_Lvl_20 = Feat_Lvl_20.drop('CARDIAC_VALVULAR', axis=1)
Feat_Lvl_20 = Feat_Lvl_20.drop('DEPRESSION', axis=1)
Feat_Lvl_20 = Feat_Lvl_20.drop('num_ip', axis=1)
Feat_Lvl_20 = Feat_Lvl_20.drop('pre_total_cat', axis=1)
Feat_Lvl_10 = Feat_Lvl_20
#Feat_Lvl_10 = Feat_Lvl_20.drop('idx_paytypN', axis=1)
Feat_Lvl_10 = Feat_Lvl_10.drop('PERIPHERAL_VASCULAR', axis=1)
Feat_Lvl_10 = Feat_Lvl_10.drop('SMOKING', axis=1)
Feat_Lvl_10 = Feat_Lvl_10.drop('THYROID', axis=1)
Feat_Lvl_10 = Feat_Lvl_10.drop('pre_er_cost', axis=1)
Feat_Lvl_10 = Feat_Lvl_10.drop('pre_op_cost', axis=1)
Feat_Lvl_05 = Feat_Lvl_10
Feat_Lvl_05 = Feat_Lvl_05.drop('log_idx_copay', axis=1)
Feat_Lvl_05 = Feat_Lvl_05.drop('pre_ip_flag', axis=1)
Feat_Lvl_05 = Feat_Lvl_05.drop('Reg_MW', axis=1)
print(Feat_Lvl_05.shape[1])
Feat_Lvl_05.head(3)
In our R code, we got an extremely high p-value (0.581) for id_paytypN from the kruskal-williams test, so we removed it from our Feat_Lvl_10 data frame.
from sklearn.ensemble import RandomForestRegressor
X = Data.iloc[:, 1:].values
y = Data.iloc[:, 0].values
Names = Data.columns[1:]
rf = RandomForestRegressor()
rf.fit(X, y)
print('Features sorted by their score:')
print(sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), Names), reverse=True))
RF_Extract = Data[['adjust_total_30d', 'generic_cost', 'pre_total_cost', 'idx_copay', 'num_gpi6',
'num_ndc', 'pre_medical_cost', 'num_op', 'numofgen', 'pre_rx_cost', 'pre_op_cost', 'generic_rate',
'ratio_G_total_cost', 'age_cat', 'brand_cost', 'pre_CCI']]
Estimate = Data[['adjust_total_30d', 'generic_cost', 'pre_total_cost', 'log_idx_copay', 'num_gpi6',
'num_ndc', 'pre_medical_cost', 'num_op', 'numofgen', 'pre_rx_cost', 'pre_op_cost', 'generic_rate',
'ratio_G_total_cost', 'age_cat', 'brand_cost', 'pre_CCI', 'age_cat', 'DYSLIPIDEMIA', 'HYPERTENSION',
'Metastatic', 'Leukemia_Lymphoma', 'Cancer_In_Situ', 'pre_er_flag', 'log_pre_er_cost', 'Age_18_44',
'Age_45_64', 'Age_65_Up', 'Male', 'Female', 'Reg_NE', 'Reg_S', 'numofbrand']]
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
#Convert dataframe values into correstponding X matrix and y vector
X, y = Estimate.iloc[:, 1:].values, Data.iloc[:, 0].values
#Convert X and y arrays into train and test arrays
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#Standardize the Matrix values to prepare for feature processing
#sc = StandardScaler()
#X_std = sc.fit_transform(X)
#X_train_std = sc.fit_transform(X_train)
#X_test_std = sc.fit_transform(X_test)
#X_std
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train, y_train)
Est_Accuracy = lr.score(X_test, y_test)
print('Accuracy:', lr.score(X_test, y_test))
X2, y2 = Data.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.2, random_state=0)
lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train2, y_train2)
Min_Accuracy = lr.score(X_test2, y_test2)
print('Accuracy:', lr.score(X_test2, y_test2))
print(lr.fit(X_train, y_train))
from sklearn.metrics import precision_score, recall_score, f1_score
print('Precision: %.3f' % precision_score(y_true=y_test, y_pred=y_pred))
print('Recall: %.3f' % recall_score(y_true=y_test, y_pred=y_pred))
print('F1: %.3f' % f1_score(y_true=y_test, y_pred=y_pred))
X3, y3 = Feat_Lvl_70.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, test_size=0.2, random_state=0)
lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train3, y_train3)
Below_70_Accuracy = lr.score(X_test3, y_test3)
print('Accuracy:', lr.score(X_test3, y_test3))
X4, y4 = Feat_Lvl_50.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train4, X_test4, y_train4, y_test4 = train_test_split(X4, y4, test_size=0.2, random_state=0)
lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train4, y_train4)
Below_50_Accuracy = lr.score(X_test4, y_test4)
print('Accuracy:', lr.score(X_test4, y_test4))
X5, y5 = Feat_Lvl_30.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train5, X_test5, y_train5, y_test5 = train_test_split(X5, y5, test_size=0.2, random_state=0)
lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train5, y_train5)
Below_30_Accuracy = lr.score(X_test5, y_test5)
print('Accuracy:', lr.score(X_test5, y_test5))
X6, y6 = Feat_Lvl_20.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train6, X_test6, y_train6, y_test6 = train_test_split(X6, y6, test_size=0.2, random_state=0)
lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train6, y_train6)
Below_20_Accuracy = lr.score(X_test6, y_test6)
print('Accuracy:', lr.score(X_test6, y_test6))
X7, y7 = Feat_Lvl_10.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train7, X_test7, y_train7, y_test7 = train_test_split(X7, y7, test_size=0.2, random_state=0)
lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train7, y_train7)
Below_10_Accuracy = lr.score(X_test7, y_test7)
print('Accuracy:', lr.score(X_test7, y_test7))
X8, y8 = Feat_Lvl_05.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train8, X_test8, y_train8, y_test8 = train_test_split(X8, y8, test_size=0.2, random_state=0)
lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train8, y_train8)
Below_05_Accuracy = lr.score(X_test8, y_test8)
print('Accuracy:', lr.score(X_test8, y_test8))
X9, y9 = RF_Extract.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train9, X_test9, y_train9, y_test9 = train_test_split(X9, y9, test_size=0.2, random_state=0)
lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train9, y_train9)
RF_Accuracy = lr.score(X_test9, y_test9)
print('Accuracy:', lr.score(X_test9, y_test9))
arrays = [['Logistic Model'],
['Accuracy Score']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['Model Type','Data Frame Type'])
Output = pd.DataFrame(0, index=['minimal feature reduction','p-value below 0.7', 'p-value below 0.5', 'p-value below 0.3',
'p-value below 0.2', 'p-value below 0.1', 'p-value below 0.05', 'Random Forest extract',
'combined features'],
columns=index)
Output
Output.loc[('minimal feature reduction'), ('Logistic Model', 'Accuracy Score')] = Min_Accuracy
Output.loc[('p-value below 0.7'), ('Logistic Model', 'Accuracy Score')] = Below_70_Accuracy
Output.loc[('p-value below 0.5'), ('Logistic Model', 'Accuracy Score')] = Below_50_Accuracy
Output.loc[('p-value below 0.3'), ('Logistic Model', 'Accuracy Score')] = Below_30_Accuracy
Output.loc[('p-value below 0.2'), ('Logistic Model', 'Accuracy Score')] = Below_20_Accuracy
Output.loc[('p-value below 0.1'), ('Logistic Model', 'Accuracy Score')] = Below_10_Accuracy
Output.loc[('p-value below 0.05'), ('Logistic Model', 'Accuracy Score')] = Below_05_Accuracy
Output.loc[('Random Forest extract'), ('Logistic Model', 'Accuracy Score')] = RF_Accuracy
Output.loc[('combined features'), ('Logistic Model', 'Accuracy Score')] = Est_Accuracy
Output
Output.style.set_table_styles([{'selector': 'tr:hover', 'props': [('background-color', 'lavender')]},
{'selector': 'tr:nth-child(even)', 'props': [('background-color', 'lightgreen')]}])
from sklearn.metrics import confusion_matrix
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)
print(confmat)
fig, ax = plt.subplots(figsize=(4, 4))
ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.5)
for i in range(confmat.shape[0]):
for j in range(confmat.shape[1]):
ax.text(x=j, y=i, s=confmat[i, j], va='center', ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.show();
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
pipe_lr = Pipeline([('pca', PCA(n_components=2)),
('clf', LogisticRegression(random_state=1))])
pipe_lr.fit(X_train, y_train)
print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator=lr, X=X_train, y=y_train, cv=10, n_jobs=1)
print('CV Accuracy Scores: %s' % scores)
print('\n')
print('CV Mean Accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
scores = cross_val_score(estimator=pipe_lr, X=X_train, y=y_train, cv=10, n_jobs=1)
print('CV Accuracy Scores: %s' % scores)
print('\n')
print('CV Mean Accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
from sklearn.learning_curve import learning_curve
pipe_lr = Pipeline([
('scl', StandardScaler()),
('clf', LogisticRegression(penalty='l2', random_state=0))])
train_sizes, train_scores, test_scores = learning_curve(estimator = pipe_lr, X = X_train, y = y_train,
train_sizes = np.linspace(0.1, 1, 10), cv = 10, n_jobs = 1)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.plot(train_sizes, train_mean, color='blue', marker='o', markersize=5, label='training accuracy')
plt.fill_between(train_sizes, train_mean + train_std, train_mean - train_std, alpha=0.15, color='blue')
plt.plot(train_sizes, test_mean, color='green', linestyle='--', marker='s', markersize=5, label='validation accuracy')
plt.fill_between(train_sizes, test_mean + test_std, test_mean - test_std, alpha=0.15, color='green')
plt.xlabel('Number of training samples')
plt.ylabel('Accuracy')
plt.legend(loc="lower right")
plt.show()
from sklearn.learning_curve import validation_curve
param_range = [0.001, 0.01, 0.01, 1.0, 10.0, 100.0]
train_scores, test_scores = validation_curve(estimator = pipe_lr, X = X_train, y = y_train, param_name = 'clf__C',
param_range = param_range, cv = 10)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.plot(param_range, train_mean, color='blue', marker='o', markersize=5, label='training accuracy')
plt.fill_between(param_range, train_mean + train_std, train_mean - train_std, alpha = 0.15, color='blue', )
plt.plot(param_range, test_mean, linestyle='--', color='green', marker='s', markersize=5, label='validation accuracy')
plt.fill_between(param_range, test_mean + test_std, test_mean - test_std, alpha=0.15, color='green')
plt.xscale('log')
plt.legend(loc='lower right')
plt.xlabel('parameter C')
plt.ylabel('Accuracy')
plt.show()